Selection Bias, Label Bias, and Bias in Ground Truth

نویسندگان

  • Anders Søgaard
  • Barbara Plank
  • Dirk Hovy
چکیده

Language technology is biased toward English newswire. In POS tagging, we get 97–98 words right out of a 100 in English newswire, but results drop to about 8 out of 10 when running the same technology on Twitter data. In dependency parsing, we are able to identify the syntactic head of 9 out of 10 words in English newswire, but only 6–7 out of 10 in tweets. Replace references to Twitter with references to a low-resource language of your choice, and the above sentence is still likely to hold true. The reason for this bias is obviously that mainstream language technology is data-driven, based on supervised statistical learning techniques, and annotated data resources are widely available for English newswire. The situation that arises when applying off-the-shelf language technology, induced from annotated newswire corpora, to something like Twitter, is a bit like when trying to predict elections from Xbox surveys (Wang et al., 2013). Our induced models suffer from a data selection bias. This is actually not the only way our data is biased. The available resources for English newswire are the result of human annotators following specific guidelines. Humans err, leading to label bias, but more importantly, annotation guidelines typically make debatable linguistic choices. Linguistics is not an exact science, and we call the influence of annotation guidelines bias in ground truth. In the tutorial, we present various case studies for each kind of bias, and show several methods that can be used to deal with bias. This results in improved performance of NLP systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Codon bias patterns in photosynthetic genes of halophytic grass Aeluropus littoralis

Codon bias refers to the differences in the frequency of occurrence of synonymous codons in coding DNA. Pattern of codon and optimum codon utilization is significantly different between the lives. This difference is due to the long term function of natural selection and evolution process. Genetics drift, mutation and regulation of gene expression are the main reasons for codon bias. In this stu...

متن کامل

The Relative Improvement of Bias Reduction in Density Estimator Using Geometric Extrapolated Kernel

One of a nonparametric procedures used to estimate densities is kernel method. In this paper, in order to reduce bias of  kernel density estimation, methods such as usual kernel(UK), geometric extrapolation usual kernel(GEUK), a bias reduction kernel(BRK) and a geometric extrapolation bias reduction kernel(GEBRK) are introduced. Theoretical properties, including the selection of smoothness para...

متن کامل

Ground truth bias in external cluster validity indices

External cluster validity indices (CVIs) are used to quantify the quality of a clustering by comparing the similarity between the clustering and a ground truth partition. However, some external CVIs show a biased behaviour when selecting the most similar clustering. Users may consequently be misguided by such results. Recognizing and understanding the bias behaviour of CVIs is therefore crucial...

متن کامل

Effect of Bias in Contrast Agent Concentration Measurement on Estimated Pharmacokinetic Parameters in Brain Dynamic Contrast-Enhanced Magnetic Resonance Imaging Studies

Introduction: Pharmacokinetic (PK) modeling of dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is widely applied in tumor diagnosis and treatment evaluation. Precision analysis of the estimated PK parameters is essential when they are used as a measure for therapy evaluation or treatment planning. In this study, the accuracy of PK parameters in brain DCE...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014